Author : Mohammad AliAbadi

license : Shahid Beheshti University

aliabadi4mohammad@gmail.com

sgd

pca

smote

adasyn

tomeklinks

smotetomek

sgd classifier with smotetomek

sgd classifier with adasyn

logist cregression with smotetomek

test

voting

Feature Importance

Loading Libraries

I define a few helper functions to make analysis more convenient and presentable.

Loading data

Benchmark model

Overall feature importances

Default Scikit-learn's feature importances

Let's start with decision trees to build some intuition. In decision trees, every node is a condition how to split values in a single feature, so that similar values of dependent variable end up in the same set after the split. The condition is based on impurity, which in case of classification problems is Gini impurity / information gain (entropy), while for regression trees its variance. So when training a tree we can compute how much each feature contributes to decreasing the weighted impurity. feature_importances_ in Scikit-Learn is based on that logic, but in case of Random Forest we are talking about averaging the decrease in impurity over trees.

Pros:

Cons:

It seems that the top 3 most important features are:

What seems surprising though is that a column of random values turned out to be more important than:

Intuitively this feature should have zero importance on the target variable. Let's see how it is evaluated by different approaches.

Permutation feature importance

This approach directly measures feature importance by observing how random re-shuffling (thus preserving the distribution of the variable) of each predictor influences model performance.

The approach can be described in the following steps:

  1. Train the baseline model and record the score (accuracy/R^2/any metric of importance) by passing validation set (or OOB set in case of Random Forest). This can also be done on the training set, at the cost of sacrificing information about generalisation.
  2. Re-shuffle values from one feature in the selected dataset, pass the dataset to the model again to obtain predictions and calculate the metric for this modified dataset. The feature importance is the difference between the benchmark score and the one from the modified (permuted) dataset.
  3. Repeat 2. for all feature in the dataset.

Pros:

Cons:

As for the second problem with this method, I have already plotted the correlation matrix above. However, I will use a function from one of the libraries I use to visualise Spearman's correlations. The difference between standard Pearson's correlation is that this one first transforms variables into ranks and only then runs Pearson's correlation on the ranks.

Spearman's correlation:

I found two libraries with this functionality, not that it is difficult to code it. Let's go over both of them as they have some unique features.

eli5

There are a few differences from the basic approach of rfpimp and the one employed in eli5. Some of them are:

The results are very similar to the previous ones, even as these came from multiple reshuffles per column.

The default importance DataFrame is not the most readable, as it does not contain variable names. This can be of course quite easily fixed. The nice thing is the standard error from all iterations of the reshuffling on each variable.

One extra nice thing about eli5 is that it is really easy to use the results of permutation approach to carry out feature selection by using Scikit-learn's SelectFromModel or RFE.

LIME

LIME (Local Interpretable Model-agnostic Explanations) is a technique explaining the predictions of any classifier/regressor in an interpretable and faithful manner. To do so, an explanation is obtained by locally approximating the selected model with an interpretable one (such as linear models with regularisation or decision trees). The interpretable models are trained on small perturbations (adding noise) of the original observation (row in case of tabular data), thus they only provide good local approximation.

Some drawbacks to be aware of:

Below you can see the output of LIME interpretation.

There are 3 parts of the output:

  1. Predicted value
  2. Feature importance - in case of regression it shows whether it has a negative or positive impact on the prediction, sorted by absolute impact descending.
  3. Actual values of these features for the explained rows.

Note that LIME has discretized the features in the explanation. This is because of setting discretize_continuous=True in the constructor above. The reason for discretization is that it gives continuous features more intuitive explanations.